Observations

What do the data types look like?

Observations

Let's look at missing values

Observations

Let's look at unique values

Observations

What does the product distribution look like from what's pitched?

Observations

Attrition Ranking for category owned & number of attrited:

Blue -> Silver -> Gold -> Platinum

It's clear that the blue card does the best in terms of volume, but it is also the one with the highest churn.

Let's look at Attrition _Rates.

The Attrition _Rates for card category tell a slightly different story in terms of ranking:

Platinum -> Gold -> Blue -> Silver

Average utilization rate is a little on the higher side -> good would be under 20%.

Univariate Analysis

Pandas-profiling Report Review

CLIENTNUM: Need to drop, record id.

Attrition_Flag: 16% churn.

Customer_Age: Normally distributed, though there's an odd spike around 50.

Gender: Split pretty evenly.

Dependent_count: Looks relatively normally distributed.

Education_Level: Customers seem relatively well educated at over 50% with higher education.

Marital_Status: Most customers are married, but single is a close second.

Income_Category: Most customers make less than $60k.

Card_Category: Blue is the largest category in terms of volume, with silver in second.

Months_on_book: Massive spike at 36 months, may need to deal with this.

Total_Relationship_Count: Flat from 3+, maybe not useful.

Months_Inactive_12_mon: Spike in 1,2,3 - may want to look into more.

Contacts_Count_12_mon: Seems normally distributed, but there are a lot of individuals with 0 contacts at ~4%

Credit_Limit: Unusual number of individuals with a high credit limit (~34k)

Total_Revolving_Bal: Most people keep a revolving balance of 0, spike at $2500. Otherwise distribution looks okay.

Avg_Open_To_Buy: Significant right skewness at 1.66, should take logs.

Total_Amt_Chng_Q4_Q1: Going to look into outliers here.

Total_Trans_Amt: Looks like there are groups of observations here, may want to split into bins.

Total_Trans_Ct: Looks like there are groups of observations here, may want to split into bins.

Total_Ct_Chng_Q4_Q1: Going to look into outliers here.

Avg_Utilization_Ratio: Most utilization is 0, which makes sense given the number of people that maintain a balance of 0.

Let's clean up the gender variable, then drop a few unnecessary columns.

Run some one hot encoding

Bivariate Analysis

We can confirm visually here that the marginal propensity to attrit flattens around the 3-4 month mark.

Feature Engineering

Take logs to deal with some of the attribute skewness

Other transformations

Going to convert total transaction amount and count to categorical, since observations appear to be clustered around certain levels.

Next, OHE the newly created categories.

Check the data and drop unneccessary columns.

Going to drop a few records that have missing values.

Only 7 records dropped - this is a marginal amount of lost information.

Split Data

Define some functions

Model scoring

Confusion Matrix

Model Building



For this analysis, we're going to assume that the cost of losing a customer is high and that there aren't associated marketing costs since the org is going to work on services. As such the focus will be on recall and ROC-AUC.

Model 1 - Logistic Regression

Let's evaluate the model performance by using KFold and cross_val_score

Update

Oversampling train data using SMOTE

Logistic Regression on oversampled data

Let's evaluate the model performance by using KFold and cross_val_score

Need to update

Regularization

Undersampling train data using SMOTE

Logistic Regression on undersampled data

Let's evaluate the model performance by using KFold and cross_val_score

We've been able to significantly improve the performance of the model with up / down - sampling. Undersampled data is the top performer in the logistic regression models.

Model 2 - Bagging Classifier

Model 3 - Random Forest

Model 4 - Decision Tree w/ Pruning

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Recall vs alpha for training and testing sets

Model 5 - AdaBoost

Gradient Boosting

XGBoost

Aggregate Output

Grid Search Hyperparameter Tuning

Model 8 - Tuned Logisitic Regression w/ Under Sampling

Model 9 - Tuned Logisitic Regression w/ Over Sampling

Model 10 - XGBoost Tuned

Random Search Hyperparameter Tuning

Model 11 - Tuned Logisitic Regression w/ Under Sampling

Model 12 - Tuned Logisitic Regression w/ Over Sampling

Model 13 - XGBoost Tuned

Summary

Overall interesting model results - we definitely saw the impact of over and under-sampling the data as well as hyper parameter tuning and the use of pipelines. We end up with a very strong model based on recall with XGBoost and randomized search CV. Many models performed well, however this outperformed all others.

The organizations now has a great model for predicting which customers are at risk of churn.